log(carat) vs relative price represent? What does the diagonal line without any points represent?Consider the linear relationship between lcarat and lprice:
diamonds2 <- diamonds %>%
mutate(
lcarat = log2(carat),
lprice = log2(price)
)
diamonds2
## # A tibble: 53,940 x 12
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39
## # ... with 53,930 more rows, and 2 more variables: lcarat <dbl>,
## # lprice <dbl>
ggplot(diamonds2, aes(lcarat,lprice)) +
geom_bin2d() +
geom_smooth(method="lm",se=FALSE,size=2,color="yellow")
We can see above that the x-axis has been extended to the right compared to the original plot (since the original plot was limited to 2 or fewer carats). The range of
lprice did not increase.
mod <- lm(lprice ~ lcarat, data=diamonds2)
diamonds2 <- diamonds2 %>% mutate(rel_price = resid(mod))
diamonds2
## # A tibble: 53,940 x 13
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39
## # ... with 53,930 more rows, and 3 more variables: lcarat <dbl>,
## # lprice <dbl>, rel_price <dbl>
ggplot(diamonds2,aes(carat,rel_price)) + geom_bin2d()
The above plot is not good as it is showing a distinct pattern in the residuals. The linear model is overpredicting the log price for large carats. The variance of the residuals is significantly higher for smaller carats. Looking back at the picture of the linear model, the model extends beyond the range of lprice, thus there are only points below the line at this point (which is why we see negative residual.)
color_cut <- diamonds2 %>%
group_by(color, cut) %>%
summarize(
price = mean(price),
rel_price = mean(rel_price)
)
color_cut
## # A tibble: 35 x 4
## # Groups: color [?]
## color cut price rel_price
## <ord> <ord> <dbl> <dbl>
## 1 D Fair 4291.061 -0.06695865
## 2 D Good 3405.382 -0.03902942
## 3 D Very Good 3470.467 0.10763401
## 4 D Premium 3631.293 0.11368869
## 5 D Ideal 2629.095 0.21605978
## 6 E Fair 3682.312 -0.16154694
## 7 E Good 3423.644 -0.04676044
## 8 E Very Good 3214.652 0.06901578
## 9 E Premium 3538.914 0.08742779
## 10 E Ideal 2597.550 0.17323162
## # ... with 25 more rows
ggplot(color_cut, aes(color, price)) +
geom_line(aes(group=cut), color="grey80") +
geom_point(aes(color = cut))
Above we see the same trend as in the book, where the lowest quality diamonds (color=J) hae the highest price, especially for Premium cut diamonds. Plotting relative price against color solves the issue.
ggplot(color_cut, aes(color, rel_price)) +
geom_line(aes(group = cut), color = "grey80") +
geom_point(aes(color=cut))
ggplot(diamonds, aes(color, carat)) + geom_boxplot()
color_cut_clarity <- diamonds2 %>%
group_by(color, cut, clarity) %>%
summarize(
price = mean(price),
rel_price = mean(rel_price)
)
ggplot(color_cut_clarity, aes(color,rel_price)) +
geom_line(aes(group=cut), color="grey80") +
geom_point(aes(color=cut)) +
facet_wrap(~clarity)
Table and depth do not have much relatiohship with relative price. This makes sense as table and depth are highly correlated to carat (weight) and we’ve already taken out the effect of carat.
#diamonds2
ggplot(diamonds2,aes(depth,rel_price)) +
geom_bin2d()
#diamonds2
ggplot(diamonds2,aes(table,rel_price)) +
geom_bin2d()
Instead of using geom_line, we can draw a loess curve using geom_smooth:
txhousing
## # A tibble: 8,602 x 9
## city year month sales volume median listings inventory date
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abilene 2000 1 72 5380000 71400 701 6.3 2000.000
## 2 Abilene 2000 2 98 6505000 58700 746 6.6 2000.083
## 3 Abilene 2000 3 130 9285000 58100 784 6.8 2000.167
## 4 Abilene 2000 4 98 9730000 68600 785 6.9 2000.250
## 5 Abilene 2000 5 141 10590000 67300 794 6.8 2000.333
## 6 Abilene 2000 6 156 13910000 66900 780 6.6 2000.417
## 7 Abilene 2000 7 152 12635000 73500 742 6.2 2000.500
## 8 Abilene 2000 8 131 10710000 75000 765 6.4 2000.583
## 9 Abilene 2000 9 104 7615000 64500 771 6.5 2000.667
## 10 Abilene 2000 10 101 7040000 59300 764 6.6 2000.750
## # ... with 8,592 more rows
#de-season function:
deseas <- function(x,month){
resid(lm(x ~ factor(month), na.action = na.exclude))
}
txhousing <- txhousing %>%
group_by(city) %>%
mutate(rel_sales = deseas(log(sales), month))
ggplot(txhousing, aes(date,rel_sales)) +
geom_line(aes(group=city), alpha = 1/5) +
#geom_line(stat = "summary", fun.y="mean", color="red")
geom_smooth(method="loess",se=FALSE)
## Warning: Removed 568 rows containing non-finite values (stat_smooth).
## Warning: Removed 430 rows containing missing values (geom_path).
+ xlim(2008,2012)) at the long-term trend you’ll notice a weird pattern in 2009-2011. It looks like there was a big dip in 2010. Is this dip “real”? (i.e. can you spot it in the original data)Looking at the relative sales (after taking out the month effect), a number of cities do indeed tend to have a dip in 2010:
ggplot(txhousing, aes(date,rel_sales)) +
geom_line(aes(group=city), alpha = 1/5) +
geom_line(stat = "summary", fun.y="mean", color="red") +
#geom_smooth(method="loess",se=FALSE) +
xlim(2008,2012)
## Warning: Removed 6377 rows containing non-finite values (stat_summary).
## Warning: Removed 6376 rows containing missing values (geom_path).
TODO: Maybe there are more missing data points in 2010? Check.
sales2010 <- txhousing %>% filter(year==2010)
sales2010
## # A tibble: 552 x 10
## # Groups: city [46]
## city year month sales volume median listings inventory date
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abilene 2010 1 73 9130783 112200 868 6.4 2010.000
## 2 Abilene 2010 2 93 10372904 98300 830 6.1 2010.083
## 3 Abilene 2010 3 133 16517713 114000 854 6.3 2010.167
## 4 Abilene 2010 4 161 18788002 103600 859 6.3 2010.250
## 5 Abilene 2010 5 200 22804393 99300 914 6.5 2010.333
## 6 Abilene 2010 6 169 23216943 127900 932 6.7 2010.417
## 7 Abilene 2010 7 159 22363123 127300 915 6.6 2010.500
## 8 Abilene 2010 8 144 17504580 122000 936 6.7 2010.583
## 9 Abilene 2010 9 116 15475763 121300 899 6.5 2010.667
## 10 Abilene 2010 10 111 14570529 111900 863 6.4 2010.750
## # ... with 542 more rows, and 1 more variables: rel_sales <dbl>
summary(txhousing)
## city year month sales
## Length:8602 Min. :2000 Min. : 1.000 Min. : 6.0
## Class :character 1st Qu.:2003 1st Qu.: 3.000 1st Qu.: 86.0
## Mode :character Median :2007 Median : 6.000 Median : 169.0
## Mean :2007 Mean : 6.406 Mean : 549.6
## 3rd Qu.:2011 3rd Qu.: 9.000 3rd Qu.: 467.0
## Max. :2015 Max. :12.000 Max. :8945.0
## NA's :568
## volume median listings inventory
## Min. :8.350e+05 Min. : 50000 Min. : 0 Min. : 0.000
## 1st Qu.:1.084e+07 1st Qu.:100000 1st Qu.: 682 1st Qu.: 4.900
## Median :2.299e+07 Median :123800 Median : 1283 Median : 6.200
## Mean :1.069e+08 Mean :128131 Mean : 3217 Mean : 7.175
## 3rd Qu.:7.512e+07 3rd Qu.:150000 3rd Qu.: 2954 3rd Qu.: 8.150
## Max. :2.568e+09 Max. :304200 Max. :43107 Max. :55.900
## NA's :568 NA's :616 NA's :1424 NA's :1467
## date rel_sales
## Min. :2000 Min. :-1.8473
## 1st Qu.:2004 1st Qu.:-0.1404
## Median :2008 Median : 0.0117
## Mean :2008 Mean : 0.0000
## 3rd Qu.:2012 3rd Qu.: 0.1531
## Max. :2016 Max. : 0.9080
## NA's :568
#"Months inventory": amount of time it would take to sell all current listings at current pace of sales.
ggplot(txhousing, aes(date,log(inventory))) +
geom_line(aes(group=city), alpha=1/2)
## Warning: Removed 603 rows containing missing values (geom_path).
#Total active listings
ggplot(txhousing, aes(date,log(listings))) +
geom_line(aes(group=city), alpha=1/2)
## Warning: Removed 518 rows containing missing values (geom_path).
#Total value of sales
ggplot(txhousing, aes(date,log(volume))) +
geom_line(aes(group=city), alpha=1/2)
## Warning: Removed 430 rows containing missing values (geom_path).
#Median sale price:
ggplot(txhousing, aes(date,log(median))) +
geom_line(aes(group=city), alpha=1/2)
## Warning: Removed 446 rows containing missing values (geom_path).
dplyr skills to figure out how much data each city is missing. Display the results with a visualization.numDates <- txhousing %>%
summarize(nDates=n_distinct(date)) %>%
select(nDates) %>%
max()
numDates
## [1] 187
cityCnts <- txhousing %>%
na.omit() %>%
group_by(city) %>%
summarize(nComplete=n()) %>%
mutate(pctComplete=nComplete/numDates)
cityCnts
## # A tibble: 46 x 3
## city nComplete pctComplete
## <chr> <int> <dbl>
## 1 Abilene 186 0.9946524
## 2 Amarillo 182 0.9732620
## 3 Arlington 186 0.9946524
## 4 Austin 187 1.0000000
## 5 Bay Area 186 0.9946524
## 6 Beaumont 187 1.0000000
## 7 Brazoria County 129 0.6898396
## 8 Brownsville 82 0.4385027
## 9 Bryan-College Station 187 1.0000000
## 10 Collin County 186 0.9946524
## # ... with 36 more rows
ggplot(cityCnts, aes(city,pctComplete)) +
geom_col() +
labs(x="City", y="% Complete") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
stat_summary() did with dplyr so you can plot the data “by hand”.Comment: I think this question is refering to the geom_line shown on page 229 using stat="summary" and fun.y="mean".
meanRelSalesData <- txhousing %>%
group_by(date) %>%
summarize(meanRelSales = mean(rel_sales, na.rm=TRUE))
meanRelSalesData
## # A tibble: 187 x 2
## date meanRelSales
## <dbl> <dbl>
## 1 2000.000 -0.2625059
## 2 2000.083 -0.1535309
## 3 2000.167 -0.1775694
## 4 2000.250 -0.3273133
## 5 2000.333 -0.2263179
## 6 2000.417 -0.2185162
## 7 2000.500 -0.3047979
## 8 2000.583 -0.1751718
## 9 2000.667 -0.2235483
## 10 2000.750 -0.2263248
## # ... with 177 more rows
ggplot(txhousing, aes(date,rel_sales)) +
geom_line(aes(group=city), alpha = 1/5) +
geom_line(aes(date,meanRelSales),data=meanRelSalesData,color="red")
## Warning: Removed 430 rows containing missing values (geom_path).
#install.packages("broom")
library(broom)
models <- txhousing %>%
group_by(city) %>%
do(mod = lm(log2(sales) ~ factor(month), data = ., na.action = na.exclude))
models
## Source: local data frame [46 x 2]
## Groups: <by row>
##
## # A tibble: 46 x 2
## city mod
## * <chr> <list>
## 1 Abilene <S3: lm>
## 2 Amarillo <S3: lm>
## 3 Arlington <S3: lm>
## 4 Austin <S3: lm>
## 5 Bay Area <S3: lm>
## 6 Beaumont <S3: lm>
## 7 Brazoria County <S3: lm>
## 8 Brownsville <S3: lm>
## 9 Bryan-College Station <S3: lm>
## 10 Collin County <S3: lm>
## # ... with 36 more rows
model_sum <- models %>% glance(mod)
model_sum
## # A tibble: 46 x 12
## # Groups: city [46]
## city r.squared adj.r.squared sigma statistic
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Abilene 0.5298893 0.5003394 0.2817299 17.932066
## 2 Amarillo 0.4493699 0.4147588 0.3015624 12.983427
## 3 Arlington 0.5132337 0.4826369 0.2673618 16.774130
## 4 Austin 0.4873283 0.4551032 0.3102974 15.122641
## 5 Bay Area 0.5552242 0.5272669 0.2646016 19.859696
## 6 Beaumont 0.4304326 0.3946312 0.2754690 12.022792
## 7 Brazoria County 0.5082062 0.4746054 0.2919275 15.124817
## 8 Brownsville 0.1714455 0.1187629 0.4196615 3.254307
## 9 Bryan-College Station 0.6511249 0.6291956 0.4061146 29.692015
## 10 Collin County 0.6009165 0.5758312 0.2658377 23.954970
## # ... with 36 more rows, and 7 more variables: p.value <dbl>, df <int>,
## # logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>
ggplot(model_sum, aes(AIC, reorder(city,AIC))) +
geom_point()
Looking at AIC: Harlingen, San Marcos, and Brownsville have the highest AIC (2 of the three are the same cities that had the lowest \(R^2\)). Dallas, El Paso, and Midland have the lowest AIC, which are different than those having the highest \(R^2\). (Recall low AIC is better.)
worstAIC <- c("Harlingen", "San Marcos", "Brownsville")
bestAIC <- c("Dallas", "El Paso", "Midland")
extreme <- txhousing %>% ungroup() %>%
filter(city %in% c(worstAIC,bestAIC), !is.na(sales)) %>%
mutate(city = factor(city, c(worstAIC, bestAIC)))
ggplot(extreme, aes(month, log(sales))) +
geom_line(aes(group=year)) +
facet_wrap(~city)
TODO: Why the change?
Looking at the plot on page 234, McAllen (low \(R^2\)) and Bryan-College Station (high \(R^2\)) have similar order of magnitude of sales. This is also apparent looking at a boxplot of their sales.
txhousing %>% ungroup() %>%
group_by(city) %>%
filter(city %in% c("McAllen","Bryan-College Station"), !is.na(sales)) %>%
summarize(meanSales=mean(sales))
## # A tibble: 2 x 2
## city meanSales
## <chr> <dbl>
## 1 Bryan-College Station 186.7433
## 2 McAllen 162.1730
selectCities <- txhousing %>%
filter(city %in% c("McAllen","Bryan-College Station"), !is.na(sales))
ggplot(selectCities, aes(city,sales)) + geom_boxplot()
If we look at all of the extreme cities, however, we do see that NE Tarrant County (high \(R^2\)) does have significantly more sales. Also, Brownsville and Harlengen (both having low \(R^2\)) do have fewer sales than the others.
selectCities <- txhousing %>%
filter(city %in% c("McAllen","Bryan-College Station", "Lubbock", "NE Tarrant County", "Brownsville", "Harlingen"), !is.na(sales))
ggplot(selectCities, aes(city,sales)) + geom_boxplot()
log(sales) ~ factor(month) + year).models <- txhousing %>%
group_by(city) %>%
do(mod = lm(log2(sales) ~ factor(month) + year, data = ., na.action = na.exclude))
models
## Source: local data frame [46 x 2]
## Groups: <by row>
##
## # A tibble: 46 x 2
## city mod
## * <chr> <list>
## 1 Abilene <S3: lm>
## 2 Amarillo <S3: lm>
## 3 Arlington <S3: lm>
## 4 Austin <S3: lm>
## 5 Bay Area <S3: lm>
## 6 Beaumont <S3: lm>
## 7 Brazoria County <S3: lm>
## 8 Brownsville <S3: lm>
## 9 Bryan-College Station <S3: lm>
## 10 Collin County <S3: lm>
## # ... with 36 more rows
model_sum <- models %>% glance(mod)
model_sum
## # A tibble: 46 x 12
## # Groups: city [46]
## city r.squared adj.r.squared sigma statistic
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Abilene 0.6843255 0.6625548 0.2315246 31.433387
## 2 Amarillo 0.5313000 0.4989758 0.2790224 16.436631
## 3 Arlington 0.5773500 0.5482018 0.2498469 19.807350
## 4 Austin 0.6540551 0.6301968 0.2556268 27.414184
## 5 Bay Area 0.6608046 0.6374118 0.2317349 28.248216
## 6 Beaumont 0.5200054 0.4869023 0.2536079 15.708670
## 7 Brazoria County 0.5091646 0.4723519 0.2925529 13.831237
## 8 Brownsville 0.2355866 0.1822555 0.4042607 4.417429
## 9 Bryan-College Station 0.8510313 0.8407576 0.2661372 82.835859
## 10 Collin County 0.6889276 0.6674743 0.2353747 32.112942
## # ... with 36 more rows, and 7 more variables: p.value <dbl>, df <int>,
## # logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>
ggplot(model_sum, aes(r.squared, reorder(city,r.squared))) +
geom_point()
Harlingen moved up 10 places (comparing the \(R^2\) values) by adding year to the model.
models <- txhousing %>%
group_by(city) %>%
do(mod = lm(log2(sales) ~ factor(month), data = ., na.action = na.exclude))
model_sum <- models %>% glance(mod)
model_sum
## # A tibble: 46 x 12
## # Groups: city [46]
## city r.squared adj.r.squared sigma statistic
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Abilene 0.5298893 0.5003394 0.2817299 17.932066
## 2 Amarillo 0.4493699 0.4147588 0.3015624 12.983427
## 3 Arlington 0.5132337 0.4826369 0.2673618 16.774130
## 4 Austin 0.4873283 0.4551032 0.3102974 15.122641
## 5 Bay Area 0.5552242 0.5272669 0.2646016 19.859696
## 6 Beaumont 0.4304326 0.3946312 0.2754690 12.022792
## 7 Brazoria County 0.5082062 0.4746054 0.2919275 15.124817
## 8 Brownsville 0.1714455 0.1187629 0.4196615 3.254307
## 9 Bryan-College Station 0.6511249 0.6291956 0.4061146 29.692015
## 10 Collin County 0.6009165 0.5758312 0.2658377 23.954970
## # ... with 36 more rows, and 7 more variables: p.value <dbl>, df <int>,
## # logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>
ggplot(txhousing, aes(month, log(sales))) +
geom_line(aes(group=year)) +
facet_wrap(~city)
TODO: How to order facets by \(R^2\)? Maybe join datasets and then us reorder statment.
coef <- models %>% tidy(mod)
coef
## # A tibble: 552 x 6
## # Groups: city [46]
## city term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Abilene (Intercept) 6.5420182 0.07043248 92.883541 7.898616e-151
## 2 Abilene factor(month)2 0.3537962 0.09960657 3.551937 4.914925e-04
## 3 Abilene factor(month)3 0.6747930 0.09960657 6.774584 1.826729e-10
## 4 Abilene factor(month)4 0.7489369 0.09960657 7.518950 2.758003e-12
## 5 Abilene factor(month)5 0.9164846 0.09960657 9.201046 1.056158e-16
## 6 Abilene factor(month)6 1.0024412 0.09960657 10.064007 4.372570e-19
## 7 Abilene factor(month)7 0.9539842 0.09960657 9.577523 9.810940e-18
## 8 Abilene factor(month)8 0.9337577 0.10125307 9.222019 9.259347e-17
## 9 Abilene factor(month)9 0.6036668 0.10125307 5.961961 1.342427e-08
## 10 Abilene factor(month)10 0.6084391 0.10125307 6.009093 1.055654e-08
## # ... with 542 more rows
How does strength of seasonal effect relate to the \(R^2\) for the model? Answer with a plot.
You should be extra cautious when your results agree with your prior beliefs. How can you confirm or refute my hypothesis about the causes of strong seasonal patterns?
A common diagnostic plot is fitted values (.fitted) vs. residuals (.resid). Do you see any patterns? What if you include the city or month on the same plot?
Create a time series of log(sales) for each city. Highlight points tha thave a standardized residual of greater than 2.